Back

European Radiology

Springer Science and Business Media LLC

All preprints, ranked by how well they match European Radiology's content profile, based on 11 papers previously published here. The average preprint has a 0.14% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
A Comparative Study: Diagnostic Performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and Radiologists in Thoracic Radiology Cases

Gunes, Y. C.; Cesur, T.

2024-01-20 radiology and imaging 10.1101/2024.01.18.24301495
Top 0.1%
250× avg
Show abstract

PurposeTo investigate and compare the diagnostic performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and two board-certified radiologists in thoracic radiology cases published by The Society of Thoracic Radiology. Materials and MethodsWe collected 124 "Case of the Month" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into ChatGPT 3.5, Google Bard, and Microsoft Bing for diagnosis and differential diagnosis. Two board-certified radiologists provided their diagnoses. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or non-specific for radiological diagnosis. Diagnostic accuracy and differential diagnosis scores were analyzed using chi-square, Kruskal-Wallis and Mann-Whitney U tests. ResultsAmong 124 cases, ChatGPT demonstrated the highest diagnostic accuracy (53.2%), outperforming radiologists (52.4% and 41.1%), Bard (33.1%), and Bing (29.8%). Specific cases revealed varying diagnostic accuracies, with Radiologist I achieving (65.6%), surpassing ChatGPT (63.5%), Radiologist II (52.0%), Bard (39.5%), and Bing (35.4%). ChatGPT 3.5 and Bing had higher differential scores in specific cases (P<0.05), whereas Bard did not (P=0.114). All three had a higher diagnostic accuracy in specific cases (P<0.05). No differences were found in the diagnostic accuracy or differential diagnosis scores of the four anatomical location (P>0.05). ConclusionChatGPT 3.5 demonstrated higher diagnostic accuracy than Bing, Bard and radiologists in text-based thoracic radiology cases. Large language models hold great promise in this field under proper medical supervision.

2
Empowering Radiologists with ChatGPT-4o: Comparative Evaluation of Large Language Models and Radiologists in Cardiac Cases

Cesur, T.; Gunes, Y. C.; Camur, E.; Dagli, M.

2024-06-25 radiology and imaging 10.1101/2024.06.25.24309247
Top 0.1%
249× avg
Show abstract

PurposeThis study evaluated the diagnostic accuracy and differential diagnosis capabilities of 12 Large Language Models (LLMs), one cardiac radiologist, and three general radiologists in cardiac radiology. The impact of ChatGPT-4o assistance on radiologist performance was also investigated. Materials and MethodsWe collected publicly available 80 "Cardiac Case of the Month from the Society of Thoracic Radiology website. LLMs and Radiologist-III were provided with text-based information, whereas other radiologists visually assessed the cases with and without ChatGPT-4o assistance. Diagnostic accuracy and differential diagnosis scores (DDx Score) were analyzed using the chi-square, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests. ResultsThe unassisted diagnostic accuracy of the cardiac radiologist was 72.5%, General Radiologist-I was 53.8%, and General Radiologist-II was 51.3%. With ChatGPT-4o, the accuracy improved to 78.8%, 70.0%, and 63.8%, respectively. The improvements for General Radiologists-I and II were statistically significant (P[&le;]0.006). All radiologists DDx scores improved significantly with ChatGPT-4o assistance (P[&le;]0.05). Remarkably, Radiologist-Is GPT-4o-assisted diagnostic accuracy and DDx Score were not significantly different from the Cardiac Radiologists unassisted performance (P>0.05). Among the LLMs, Claude 3.5 Sonnet and Claude 3 Opus had the highest accuracy (81.3%), followed by Claude 3 Sonnet (70.0%). Regarding the DDx Score, Claude 3 Opus outperformed all models and Radiologist-III (P<0.05). The accuracy of the general radiologist-III significantly improved from 48.8% to 63.8% with GPT4o-assistance (P<0.001). ConclusionChatGPT-4o may enhance the diagnostic performance of general radiologists for cardiac imaging, suggesting its potential as a valuable diagnostic support tool. Further research is required to assess its clinical integration.

3
Comparison of the diagnostic accuracy among GPT-4 based ChatGPT, GPT-4V based ChatGPT, and radiologists in musculoskeletal radiology

Horiuchi, D.; Tatekawa, H.; Oura, T.; Shimono, T.; Walston, S. L.; Takita, H.; Matsushita, S.; Mitsuyama, Y.; Miki, Y.; Ueda, D.

2023-12-09 radiology and imaging 10.1101/2023.12.07.23299707
Top 0.1%
157× avg
Show abstract

ObjectiveTo compare the diagnostic accuracy of Generative Pre-trained Transformer (GPT)-4 based ChatGPT, GPT-4 with vision (GPT-4V) based ChatGPT, and radiologists in musculoskeletal radiology. Materials and MethodsWe included 106 "Test Yourself" cases from Skeletal Radiology between January 2014 and September 2023. We input the medical history and imaging findings into GPT-4 based ChatGPT and the medical history and images into GPT-4V based ChatGPT, then both generated a diagnosis for each case. Two radiologists (a radiology resident and a board-certified radiologist) independently provided diagnoses for all cases. The diagnostic accuracy rates were determined based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4 based ChatGPT, GPT-4V based ChatGPT, and radiologists. ResultsGPT-4 based ChatGPT significantly outperformed GPT-4V based ChatGPT (p < 0.001) with accuracy rates of 43% (46/106) and 8% (9/106), respectively. The radiology resident and the board-certified radiologist achieved accuracy rates of 41% (43/106) and 53% (56/106). The diagnostic accuracy of GPT-4 based ChatGPT was comparable to that of the radiology resident but was lower than that of the board-certified radiologist, although the differences were not significant (p = 0.78 and 0.22, respectively). The diagnostic accuracy of GPT-4V based ChatGPT was significantly lower than those of both radiologists (p < 0.001 and < 0.001, respectively). ConclusionGPT-4 based ChatGPT demonstrated significantly higher diagnostic accuracy than GPT-4V based ChatGPT. While GPT-4 based ChatGPTs diagnostic performance was comparable to radiology residents, it did not reach the performance level of board-certified radiologists in musculoskeletal radiology.

4
Assessing Performance of Multimodal ChatGPT-4 on an image based Radiology Board-style Examination: An exploratory study

Bera, K.; Gupta, A.; Jiang, S.; Berlin, S.; Faraji, N.; Tippareddy, C.; Chiong, I.; Jones, R.; Nemer, O.; Nayate, A.; Tirumani, S. H.; Ramaiya, N.

2024-01-13 radiology and imaging 10.1101/2024.01.12.24301222
Top 0.1%
157× avg
Show abstract

ObjectiveTo evaluate the performance of multimodal ChatGPT 4 on a radiology board-style examination containing text and radiologic images.s Materials and MethodsIn this prospective exploratory study from October 30 to December 10, 2023, 110 multiple-choice questions containing images designed to match the style and content of radiology board examination like the American Board of Radiology Core or Canadian Board of Radiology examination were prompted to multimodal ChatGPT 4. Questions were further sub stratified according to lower-order (recall, understanding) and higher-order (analyze, synthesize), domains (according to radiology subspecialty), imaging modalities and difficulty (rated by both radiologists and radiologists-in-training). ChatGPT performance was assessed overall as well as in subcategories using Fishers exact test with multiple comparisons. Confidence in answering questions was assessed using a Likert scale (1-5) by consensus between a radiologist and radiologist-in-training. Reproducibility was assessed by comparing two different runs using two different accounts. ResultsChatGPT 4 answered 55% (61/110) of image-rich questions correctly. While there was no significant difference in performance amongst the various sub-groups on exploratory analysis, performance was better on lower-order [61% (25/41)] when compared to higher-order [52% (36/69)] [P=.46]. Among clinical domains, performance was best on cardiovascular imaging [80% (8/10)], and worst on thoracic imaging [30% [3/10)]. Confidence in answering questions was confident/highly confident [89%(98/110)], even when incorrect There was poor reproducibility between two runs, with the answers being different in 14% (15/110) questions. ConclusionDespite no radiology specific pre-training, multimodal capabilities of ChatGPT appear promising on questions containing images. However, the lack of reproducibility among two runs, even with the same questions poses challenges of reliability.

5
Artificial intelligence-generated smart impression from 9.8-million radiology reports as training datasets from multiple sites and imaging modalities

Kaviani, P.; Kalra, M. K.; Digumarthy, S. R.; Rodriguez, K.; Agarwal, S.; Brooks, R.; En, S.; Alkasab, T.; Bizzo, B. C.; Dreyer, K. J.

2024-03-09 radiology and imaging 10.1101/2024.03.07.24303787
Top 0.1%
156× avg
Show abstract

ImportanceAutomatic generation of the impression section of radiology report can help make radiologists efficient and avoid reporting errors. ObjectiveTo evaluate the relationship, content, and accuracy of an Powerscribe Smart Impression (PSI) against the radiologists reported findings and impression (RDF). Design, Setting, and ParticipantsThe institutional review board approved retrospective study developed and trained an PSI algorithm (Nuance Communications, Inc.) with 9.8 million radiology reports from multiple sites to generate PSI based on information including the protocol name and the radiologists-dictated findings section of radiology reports. Three radiologists assessed 3879 radiology reports of multiple imaging modalities from 8 US imaging sites. For each report, we assessed if PSI can accurately reproduce the RDF in terms of the number of clinically significant findings and radiologists style of reporting while avoiding potential mismatch (with the findings section in terms of size, location, or laterality). Separately we recorded the word count for PSI and RDF. Data were analyzed with Pearson correlation and paired t-tests. Main Outcomes and MeasuresThe data were ground truthed by three radiologists. Each radiologists recorded the frequency of the incidental/significant findings, any inconsistency between the RDF and PSI as well as the stylistic evaluation overall evaluation of PSI. Area under the curve (AUC), correlation coefficient, and the percentages were calculated. ResultsPSI reports were deemed either perfect (91.9%) or acceptable (7.68%) for stylistic concurrence with RDF. Both PSI (mismatched Hallers Index) and RDF (mismatched nodule size) had one mismatch each. There was no difference between the word counts of PSI (mean 33{+/-}23 words/impression) and RDF (mean 35{+/-}24 words/impression) (p>0.1). Overall, there was an excellent correlation (r= 0.85) between PSI and RDF for the evolution of findings (negative vs. stable vs. new or increasing vs. resolved or decreasing findings). The PSI outputs (2%) requiring major changes pertained to reports with multiple impression items. Conclusion and RelevanceIn clinical settings of radiology exam interpretation, the Powerscribe Smart Impression assessed in our study can save interpretation time; a comprehensive findings section results in the best PSI output.

6
Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4(V) in Challenging Brain MRI Cases

Schramm, S.; Preis, S.; Metz, M.-C.; Jung, K.; Schmitz-Koep, B.; Zimmer, C.; Wiestler, B.; Hedderich, D. M.; Kim, S. H.

2024-03-06 radiology and imaging 10.1101/2024.03.05.24303767
Top 0.1%
156× avg
Show abstract

BackgroundRecent studies have explored the application of multimodal large language models (LLMs) in radiological differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood. PurposeTo evaluate the impact of varying multimodal input elements on the accuracy of GPT-4(V)-based brain MRI differential diagnosis. MethodsThirty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image, image annotation, medical history, image description) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine ((C) PerplexityAI, powered by GPT-4(V)). Accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a chi-square test and a Kruskal-Wallis test. Results were corrected for false discovery rate employing the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each individual input element to diagnostic performance. ResultsThe prompt group containing an annotated image, medical history, and image description as input exhibited the highest diagnostic accuracy (67.8% correct responses). Significant differences were observed between prompt groups, especially between groups that contained the image description among their inputs, and those that did not. Regression analyses confirmed a large positive effect of the image description on diagnostic accuracy (p << 0.001), as well as a moderate positive effect of the medical history (p < 0.001). The presence of unannotated or annotated images had only minor or insignificant effects on diagnostic accuracy. ConclusionThe textual description of radiological image findings was identified as the strongest contributor to performance of GPT-4(V) in brain MRI differential diagnosis, followed by the medical history. The unannotated or annotated image alone yielded very low diagnostic performance. These findings offer guidance on the effective utilization of multimodal LLMs in clinical practice.

7
Evaluation of an artificial intelligence model for identification of intracranial hemorrhage subtypes on computed tomography of the head

Hillis, J. M.; Bizzo, B. C.; Newbury-Chaet, I.; Mercaldo, S. F.; Chin, J. K.; Ghatak, A.; Halle, M. A.; L'Italien, E.; MacDonald, A. L.; Schultz, A. S.; Buch, K.; Conklin, J.; Pomerantz, S.; Rincon, S.; Dreyer, K. J.; Mehan, W. A.

2023-09-08 radiology and imaging 10.1101/2023.09.07.23295189
Top 0.1%
156× avg
Show abstract

ImportanceIntracranial hemorrhage is a critical finding on computed tomography (CT) of the head. ObjectiveThis study compared the accuracy of an AI model (Annalise Enterprise CTB) to consensus neuroradiologist interpretations in detecting four hemorrhage subtypes: acute subdural/epidural hematoma, acute subarachnoid hemorrhage, intra-axial hemorrhage and intraventricular hemorrhage. DesignA retrospective standalone performance assessment was conducted on datasets of non-contrast CT head cases acquired between 2016 and 2022 for each hemorrhage subtype. SettingThe cases were obtained from five hospitals in the United States. ParticipantsThe cases were obtained from patients aged 18 years or older. The positive cases were selected based on the original clinical reports using natural language processing and manual confirmation. The negative cases were selected by taking the next negative case acquired from the same CT scanner after positive cases. InterventionsEach case was interpreted independently by up to three neuroradiologists to establish consensus interpretations. Each case was then interpreted by the AI model for the presence of the relevant hemorrhage subtype. The neuroradiologists were provided with the entire CT study. The AI model separately received thin ([&le;]1.5mm) and/or thick (>1.5 and [&le;]5mm) axial series. ResultsThe four cohorts included 571 cases for acute subdural/epidural hematoma, 310 cases for acute subarachnoid hemorrhage, 926 cases for intra-axial hemorrhage and 199 cases for intraventricular hemorrhage. The AI model identified acute subdural/epidural hematoma with area under the curve (AUC) 0.973 (95% confidence interval (CI), 0.958-0.984) on thin series and 0.942 (95% CI, 0.921-0.959) on thick series; acute subarachnoid hemorrhage with AUC 0.993 (95% CI, 0.984-0.998) on thin series and 0.966 (95% CI, 0.945-0.983) on thick series; intra-axial hemorrhage with AUC 0.969 (95% CI, 0.956-0.980) on thin series and 0.966 (95% CI, 0.953-0.976) on thick series; and intraventricular hemorrhage with AUC 0.987 (95% CI, 0.969-0.997) on thin series and 0.983 (95% CI, 0.968-0.994) on thick series. Each finding had at least one operating point with sensitivity and specificity greater than 80%. Conclusions and RelevanceThe assessed AI model accurately identified intracranial hemorrhage subtypes in this CT dataset. Its use could assist the clinical workflow especially through enabling triage of abnormal CTs. Key PointsO_ST_ABSQuestionC_ST_ABSDoes a commercial artificial intelligence model accurately identify intracranial hemorrhage subtypes on computed tomography (CT) of the head? FindingsThis retrospective study used non-contrast CT studies to compare artificial intelligence model outputs to consensus neuroradiologist interpretations. The model was provided with either thin ([&le;]1.5mm) or thick (>1.5 and [&le;]5mm) series. The model detected each of acute subdural/epidural hematoma, acute subarachnoid hemorrhage, intra-axial hemorrhage and intraventricular hemorrhage with sensitivity and specificity greater than 80%. MeaningThis artificial intelligence model could assist radiologists through its accurate detection of intracranial hemorrhage subtypes.

8
Comparative Analysis of ChatGPT's Diagnostic Performance with Radiologists Using Real-World Radiology Reports of Brain Tumors

Mitsuyama, Y.; Tatekawa, H.; Takita, H.; Sasaki, F.; Tashiro, A.; Satoshi, O.; Walston, S. L.; Miki, Y.; Ueda, D.

2023-10-28 radiology and imaging 10.1101/2023.10.27.23297585
Top 0.1%
156× avg
Show abstract

BackgroundLarge Language Models like Chat Generative Pre-trained Transformer (ChatGPT) have demonstrated potential for differential diagnosis in radiology. Previous studies investigating this potential primarily utilized quizzes from academic journals, which may not accurately represent real-world clinical scenarios. PurposeThis study aimed to assess the diagnostic capabilities of ChatGPT using actual clinical radiology reports of brain tumors and compare its performance with that of neuroradiologists and general radiologists. MethodsWe consecutively collected brain MRI reports from preoperative brain tumor patients at Osaka Metropolitan University Hospital, taken from January to December 2021. ChatGPT and five radiologists were presented with the same findings from the reports and asked to suggest differential and final diagnoses. The pathological diagnosis of the excised tumor served as the ground truth. Chi-square tests and Fishers exact test were used for statistical analysis. ResultsIn a study analyzing 99 radiological reports, ChatGPT achieved a final diagnostic accuracy of 75% (95% CI: 66, 83%), while radiologists accuracy ranged from 64% to 82%. ChatGPTs final diagnostic accuracy using reports from neuroradiologists was higher at 82% (95% CI: 71, 89%), compared to 52% (95% CI: 33, 71%) using those from general radiologists with a p-value of 0.012. In the realm of differential diagnoses, ChatGPTs accuracy was 95% (95% CI: 91, 99%), while radiologists fell between 74% and 88%. Notably, for these differential diagnoses, ChatGPTs accuracy remained consistent whether reports were from neuroradiologists (96%, 95% CI: 89, 99%) or general radiologists (91%, 95% CI: 73, 98%) with a p-value of 0.33. ConclusionChatGPT exhibited good diagnostic capability, comparable to neuroradiologists in differentiating brain tumors from MRI reports. ChatGPT can be a second opinion for neuroradiologists on final diagnoses and a guidance tool for general radiologists and residents, especially for understanding diagnostic cues and handling challenging cases. SummaryThis study evaluated ChatGPTs diagnostic capabilities using real-world clinical MRI reports from brain tumor cases, revealing that its accuracy in interpreting brain tumors from MRI findings is competitive with radiologists. Key resultsO_LIChatGPT demonstrated a diagnostic accuracy rate of 75% for final diagnoses based on preoperative MRI findings from 99 brain tumor cases, competing favorably with five radiologists whose accuracies ranged between 64% and 82%. For differential diagnoses, ChatGPT achieved a remarkable 95% accuracy, outperforming several of the radiologists. C_LIO_LIRadiology reports from neuroradiologists and general radiologists showed varying accuracy when input into ChatGPT. Reports from neuroradiologists resulted in higher diagnostic accuracy for final diagnoses, while there was no difference in accuracy for differential diagnoses between neuroradiologists and general radiologists. C_LI

9
Distinguishing GPT-4-generated Radiology Abstracts from Original Abstracts: Performance of Blinded Human Observers and AI Content Detector

Ufuk, F.; Peker, H.; Sagtas, E.; Yagci, A. B.

2023-05-03 radiology and imaging 10.1101/2023.04.28.23289283
Top 0.1%
155× avg
Show abstract

ObjectiveTo determine GPT-4s effectiveness in writing scientific radiology article abstracts and investigate human reviewers and AI Content detectors success in distinguishing these abstracts. Additionally, to determine the similarity scores of abstracts generated by GPT-4 to better understand its ability to create unique text. MethodsThe study collected 250 original articles published between 2021 and 2023 in five radiology journals. The articles were randomly selected, and their abstracts were generated by GPT-4 using a specific prompt. Three experienced academic radiologists independently evaluated the GPT-4 generated and original abstracts to distinguish them as original or generated by GPT-4. All abstracts were also uploaded to an AI Content Detector and plagiarism detector to calculate similarity scores. Statistical analysis was performed to determine discrimination performance and similarity scores. ResultsOut of 134 GPT-4 generated abstracts, average of 75 (56%) were detected by reviewers, and average of 50 (43%) original abstracts were falsely categorized as GPT-4 generated abstracts by reviewers. The sensitivity, specificity, accuracy, PPV, and NPV of observers in distinguishing GPT-4 written abstracts ranged from 51.5% to 55.6%, 56.1% to 70%, 54.8% to 60.8%, 41.2% to 76.7%, and 47% to 62.7%, respectively. No significant difference was observed between observers in discrimination performance. ConclusionGPT-4 can generate convincing scientific radiology article abstracts. However, human reviewers and AI Content detectors have difficulty in distinguishing GPT-4 generated abstracts from original ones.

10
Comparison of the Diagnostic Performance from Patient's Medical History and Imaging Findings between GPT-4 based ChatGPT and Radiologists in Challenging Neuroradiology Cases

Horiuchi, D.; Tatekawa, H.; Oura, T.; Oue, S.; Walston, S. L.; Takita, H.; Matsushita, S.; Mitsuyama, Y.; Shimono, T.; Miki, Y.; Ueda, D.

2023-08-29 radiology and imaging 10.1101/2023.08.28.23294607
Top 0.1%
153× avg
Show abstract

PurposeTo compare the diagnostic performance between Chat Generative Pre-trained Transformer (ChatGPT), based on the GPT-4 architecture, and radiologists from patients medical history and imaging findings in challenging neuroradiology cases. MethodsWe collected 30 consecutive "Freiburg Neuropathology Case Conference" cases from the journal Clinical Neuroradiology between March 2016 and June 2023. GPT-4 based ChatGPT generated diagnoses from the patients provided medical history and imaging findings for each case, and the diagnostic accuracy rate was determined based on the published ground truth. Three radiologists with different levels of experience (2, 4, and 7 years of experience, respectively) independently reviewed all the cases based on the patients provided medical history and imaging findings, and the diagnostic accuracy rates were evaluated. The Chi-square tests were performed to compare the diagnostic accuracy rates between ChatGPT and each radiologist. ResultsChatGPT achieved an accuracy rate of 23% (7/30 cases). Radiologists achieved the following accuracy rates: a junior radiology resident had 27% (8/30) accuracy, a senior radiology resident had 30% (9/30) accuracy, and a board-certified radiologist had 47% (14/30) accuracy. ChatGPTs diagnostic accuracy rate was lower than that of each radiologist, although the difference was not significant (p = 0.99, 0.77, and 0.10, respectively). ConclusionThe diagnostic performance of GPT-4 based ChatGPT did not reach the performance level of either junior/senior radiology residents or board-certified radiologists in challenging neuroradiology cases. While ChatGPT holds great promise in the field of neuroradiology, radiologists should be aware of its current performance and limitations for optimal utilization.

11
Detection of Lung Cancer Cases in Chest CT Scans Utilizing Artificial Intelligence: A Retrospective Analysis of Data During the COVID-19 Pandemic

Zukov, R. A.; Safontsev, I. P.; Klimenok, M. P.; Zabrodskaya, T. E.; Merkulova, N. A.; Chernina, V. Y.; Belyaev, M. G.; Goncharov, M. Y.; Omelyanovskiy, V. V.; Ulianova, K. A.; Soboleva, E. A.; Donskova, M. A.; Blokhina, M. E.; Nalivkina, E. A.; Gombolevskiy, V. A.

2023-12-29 radiology and imaging 10.1101/2023.12.26.23299170
Top 0.1%
137× avg
Show abstract

PurposeTo evaluate the potential of using artificial intelligence (AI) focused pulmonary nodule search on chest CT data obtained during the COVID-19 pandemic to identify lung cancer (LC) patients. MethodsA multicenter, retrospective study in the Krasnoyarsk region, Russia analyzed CTs of COVID-19 patients using the automated algorithm, Chest-IRA by IRA Labs. Pulmonary nodules larger than 100 mm3 were identified by the AI and assessed by four radiologists, who categorized them into three groups: "high probability of LC", "insufficiently convincing evidence of LC", "without evidence of LC". Patients with findings were analyzed by radiologists, checked with the state cancer registry and electronic medical records. Patients with confirmed findings that were not available in the cancer registry were invited for chest CT and verification was performed according to the decision of the medical consilium. The study also estimated the economic impact of the AI by considering labor costs and savings on treatment for patients in the early stages compared to late stages, taking into account the saved life years and their potential contribution to the gross regional product. ResultsAn AI identified lung nodules in 484 out of 10,500 chest CTs. Of the 484, 355 could be evaluated, the remaining 129 had de-anonymization problems and were excluded. Of the 355, 252 cases having high and intermediate probabilities of LC, 103 were found to be false positives. From 252 was 100 histologically verified LC cases, 35 were in stages I-II and 65 were in stages III-IV. 2 lung cancers were diagnosed for the first time. Using AI instead of CT review by radiologists will save 2.43 million rubles (23,786 EUR/ 26690 USD/ 196,536 CNY) in direct salary, with expected savings to the regional budget of 8.22 million rubles (80,463 EUR/ 90,466 USD/ 666,162 CNY). The financial equivalent of the life years saved was 173.25 million rubles (1,695,892 EUR/ 1905750 USD/ 14,033,250 CNY). The total effect over five years is estimated at 183.9 million rubles (1,800,142 EUR/ 2,022,907 USD/ 14,895,949 CNY). ConclusionUsing AI to evaluate large volumes of chest CTs done for reasons unrelated to lung cancer screening may facilitate early and cost-effective detection of incidental pulmonary nodules that might otherwise be missed.

12
Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Hirano, Y.; Miki, S.; Yamagishi, Y.; Hanaoka, S.; Nakao, T.; Kikuchi, T.; Nakamura, Y.; Nomura, Y.; Yoshikawa, T.; Abe, O.

2025-06-23 radiology and imaging 10.1101/2025.06.23.25329534
Top 0.1%
137× avg
Show abstract

PurposeTo assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE). Materials and methodsThe dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemars exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedmans test, followed by pairwise Wilcoxon signed-rank tests with Holm correction. ResultsThe dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters. ConclusionRecent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Secondary abstract Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAIs o3 and Google DeepMinds Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

13
Performance and Robustness of Machine Learning-based Radiomic COVID-19 Severity Prediction

Yip, S. S. F.; Klanecek, Z.; Naganawa, S.; Kim, J.; Studen, A.; Rivetti, L.; Jeraj, R.

2020-09-09 radiology and imaging 10.1101/2020.09.07.20189977
Top 0.1%
136× avg
Show abstract

ObjectivesThis study investigated the performance and robustness of radiomics in predicting COVID-19 severity in a large public cohort. MethodsA public dataset of 1110 COVID-19 patients (1 CT/patient) was used. Using CTs and clinical data, each patient was classified into mild, moderate, and severe by two observers: (1) dataset provider and (2) a board-certified radiologist. For each CT, 107 radiomic features were extracted. The dataset was randomly divided into a training (60%) and holdout validation (40%) set. During training, features were selected and combined into a logistic regression model for predicting severe cases from mild and moderate cases. The models were trained and validated on the classifications by both observers. AUC quantified the predictive power of models. To determine model robustness, the trained models was cross-validated on the inter-observers classifications. ResultsA single feature alone was sufficient to predict mild from severe COVID-19 with [Formula] and [Formula] (p<< 0.01). The most predictive features were the distribution of small size-zones (GLSZM-SmallAreaEmphasis) for providers classification and linear dependency of neighboring voxels (GLCM-Correlation) for radiologists classification. Cross-validation showed that both [Formula]. In predicting moderate from severe COVID-19, first-order-Median alone had sufficient predictive power of [Formula]. For radiologists classification, the predictive power of the model increased to [Formula] as the number of features grew from 1 to 5. Cross-validation yielded [Formula] and [Formula]. ConclusionsRadiomics significantly predicted different levels of COVID-19 severity. The prediction was moderately sensitive to inter-observer classifications, and thus need to be used with caution. Key pointsO_LIInterpretable radiomic features can predict different levels of COVID-19 severity C_LIO_LIMachine Learning-based radiomic models were moderately sensitive to inter-observer classifications, and thus need to be used with caution C_LI

14
Diagnostic Performance of Claude 3 from Patient History and Key Images in Diagnosis Please Cases.

Kurokawa, R.; Ohizumi, Y.; Kanzawa, J.; Kurokawa, M.; Kiguchi, T.; Gonoi, W.; Abe, O.

2024-04-14 radiology and imaging 10.1101/2024.04.11.24305622
Top 0.1%
134× avg
Show abstract

BackgroundsLarge language artificial intelligence models have showed its diagnostic performance based solely on textual information from clinical history and imaging findings. However, the extent of their performance when utilizing radiological images and providing differential diagnoses has yet to be investigated. PurposeWe employed the latest version of Claude 3, Opus, released on March 4, 2024, to investigate its diagnostic performance in answering Radiologys Diagnosis Please quiz questions under three conditions: (1) when provided with clinical history alone; (2) when given clinical history along with imaging findings; and (3) when supplied with clinical history and key images. Furthermore, we evaluated the diagnostic performance of the model when instructed to list differential diagnoses. Materials and MethodsClaude 3 Opus was tasked with listing the primary diagnosis and two differential diagnoses for 322 quiz questions from Radiologys "Diagnosis Please" cases, which included cases 1 to 322, published from 1998 to 2023. The analyses were carried out under the following input conditions: Condition 1: Submitter-provided clinical history (text) alone Condition 2: Submitter-provided clinical history and imaging findings (text) Condition 3: Submitter-provided clinical history (text) and key images (PDF files) We applied McNemars tests to evaluate differences in correct response rates for primary diagnoses across Conditions 1, 2, and 3. ResultsThe correct primary diagnoses rates were 62/322 (19.3%), 178/322 (55.3%), and 93/322 (28.8%) for Conditions 1, 2, and 3, respectively. Additionally, Claude 3 Opus accurately provided the correct answer as a differential diagnosis in up to 22/322 (6.8%) of cases. There were statistically significant differences in correct response rates for primary diagnoses between all combinations of Conditions 1, 2, and 3 (p<0.001). ConclusionClaude 3 Opus demonstrated significantly improved diagnostic performance by inputting key images in addition to clinical history. The ability to list important differential diagnoses was also confirmed. Key ResultsO_LIThis study investigated Claude 3 Opuss performance in Radiology Diagnosis Please Cases using clinical history, key images, and imaging findings. C_LIO_LIKey images or imaging findings inputs significantly improved correct primary diagnoses from 19.3% to 28.8% or 55.5%, respectively. C_LIO_LIBy having two additional differential diagnoses presented, total correct responses improved by 3.1-6.8%. C_LI Summary statementLarge language AI model Claude 3 Opus demonstrated significantly improved diagnostic accuracy by adding key images with clinical history compared with clinical history alone.

15
From Community Acquired Pneumonia to COVID-19: A Deep Learning Based Method for Quantitative Analysis of COVID-19 on thick-section CT Scans

Li, Z.; Zhong, Z.; Li, Y.; Zhang, T.; Gao, L.; Jin, D.; Sun, Y.; Ye, X.; Yu, L.; Hu, Z.; Xiao, J.; Huang, L.; Tang, Y.

2020-04-23 radiology and imaging 10.1101/2020.04.17.20070219
Top 0.1%
134× avg
Show abstract

BackgroundThick-section CT scanners are more affordable for the developing countries. Considering the widely spread COVID-19, it is of great benefit to develop an automated and accurate system for quantification of COVID-19 associated lung abnormalities using thick-section chest CT images. PurposeTo develop a fully automated AI system to quantitatively assess the disease severity and disease progression using thick-section chest CT images. Materials and MethodsIn this retrospective study, a deep learning based system was developed to automatically segment and quantify the COVID-19 infected lung regions on thick-section chest CT images. 531 thick-section CT scans from 204 patients diagnosed with COVID-19 were collected from one appointed COVID-19 hospital from 23 January 2020 to 12 February 2020. The lung abnormalities were first segmented by a deep learning model. To assess the disease severity (non-severe or severe) and the progression, two imaging bio-markers were automatically computed, i.e., the portion of infection (POI) and the average infection HU (iHU). The performance of lung abnormality segmentation was examined using Dice coefficient, while the assessment of disease severity and the disease progression were evaluated using the area under the receiver operating characteristic curve (AUC) and the Cohens kappa statistic, respectively. ResultsDice coefficient between the segmentation of the AI system and the manual delineations of two experienced radiologists for the COVID-19 infected lung abnormalities were 0.74{+/-}0.28 and 0.76{+/-}0.29, respectively, which were close to the inter-observer agreement, i.e., 0.79{+/-}0.25. The computed two imaging bio-markers can distinguish between the severe and non-severe stages with an AUC of 0.9680 (p-value< 0.001). Very good agreement ({kappa} = 0.8220) between the AI system and the radiologists were achieved on evaluating the changes of infection volumes. ConclusionsA deep learning based AI system built on the thick-section CT imaging can accurately quantify the COVID-19 associated lung abnormalities, assess the disease severity and its progressions. Key ResultsA deep learning based AI system was able to accurately segment the infected lung regions by COVID-19 using the thick-section CT scans (Dice coefficient [&ge;] 0.74). The computed imaging bio-markers were able to distinguish between the non-severe and severe COVID-19 stages (area under the receiver operating characteristic curve 0.968). The infection volume changes computed by the AI system was able to assess the COVID-19 progression (Cohens kappa 0.8220). Summary StatementA deep learning based AI system built on the thick-section CT imaging can accurately quantify the COVID-19 infected lung regions, assess patients disease severity and their disease progressions.

16
A retrospective analysis of the diagnostic performance of an FDA approved software for the detection of intracranial hemorrhage

Pourmussa, B.; Gorovoy, D.

2023-11-03 radiology and imaging 10.1101/2023.11.02.23297974
Top 0.1%
133× avg
Show abstract

ObjectiveTo determine the sensitivity, specificity, accuracy, positive predictive value (PPV), and negative predictive value (NPV) of Rapid ICH, a commercially available AI model, in detecting intracranial hemorrhage (ICH) on non-contrast computed tomography (NCCT) examinations of the head at a single regional medical center. MethodsRapidAIs Rapid ICH is incorporated into real time hospital workflow to assist radiologists in the identification of ICH on NCCT examinations of the head. 412 examinations from August 2022 to January 2023 were pulled for analysis. Scans in which it was unclear if ICH was present or not, as well as scans significantly affected by motion artifact were excluded from the study. The sensitivity, specificity, accuracy, PPV, and NPV of the software were then assessed retrospectively for the remaining 406 NCCT examinations using prior radiologist report as the ground-truth. A two tailed z test with = 0.05 was preformed to determine if the sensitivity and specificity of the software in this study were significantly different from Rapid ICHs reported sensitivity and specificity. Additionally, the softwares performance was analyzed separately for the male and female populations and a chi-square test of independence was used to determine if model correctness significantly depended on sex. ResultsOf the 406 scans assessed, Rapid ICH flagged 82 ICH positive cases and 324 ICH negative cases. There were 80 examinations (19.7%) truly positive for ICH and 326 examinations (80.3%) negative for ICH. This resulted in a sensitivity of 71.3%, 95% CI [61.3%-81.2%], a specificity of 92.3%, 95% CI [89.4%-95.2%], an accuracy of 88.2%, 95% CI [85.0%-91.3%], a PPV of 69.5%, 95% CI [59.5%-79.5%], and an NPV of 92.9%, 95% CI [90.1%-95.7%]. Two examinations were excluded due to no existing information on patient sex in the electronic medical record. The resulting sensitivity was significantly different from the sensitivity reported by Rapid ICH (95%), z = 2.60, p = .009 although the resulting specificity was not significantly different from the specificity reported by Rapid ICH (94%), z = 0.65, p = .517. The model performance did not depend on sex per the chi-square test of independence: X2 (1 degree of freedom, N = 404) = 1.95, p = .162 (p > 0.05). ConclusionRapid ICH demonstrates exceptional capability in the identification of ICH, but its performance when used at this site differs from the values advertised by the company, and from assessments of the models performance by other research groups. Specifically, the sensitivity of the software at this site is significantly different from the sensitivity reported by the company. These results underscore the necessity for independent evaluation of the software at institutions where it is implemented.

17
A diagnostic and economic evaluation of the complex artificial intelligence algorithm aimed to detect 10 pathologies on the chest CT images

Chernina, V. Y.; Belyaev, M. G.; Silin, A. Y.; Avetisov, I. O.; Pyatnitskiy, I. A.; Petrash, E. A.; Basova, M. V.; Sinitsyn, V. E.; Omelyanovskiy, V. V.; Gombolevskiy, V. A.

2023-04-25 radiology and imaging 10.1101/2023.04.19.23288584
Top 0.1%
133× avg
Show abstract

BackgroundArtificial intelligence (AI) technologies can help solve the significant problem of missed findings in radiology studies. An important issue is assessing the economic benefits of implementing AI. Aimto evaluate the frequency of missed pathologies detection and the economic potential of AI technology for chest CT, validated by expert radiologists, compared with radiologists without access to AI in a private medical center. MethodsAn observational, single-center retrospective study was conducted. The study included chest CTs without IV contrast performed from 01.06.2022 to 31.07.2022 in "Yauza Hospital" LLC, Moscow. The CTs were processed using a complex AI algorithm for ten pathologies: pulmonary infiltrates, typical for viral pneumonia (COVID-19 in pandemic conditions); lung nodules; pleural effusion; pulmonary emphysema; thoracic aortic dilatation; pulmonary trunk dilatation; coronary artery calcification; adrenal hyperplasia; osteoporosis (vertebral body height and density changes). Two experts analyzed CTs and compared results with AI. Further routing was determined according to clinical guidelines for all findings initially detected and missed by radiologists. The lost potential revenue (LPR) was calculated for each patient according to the hospital price list. ResultsFrom the final 160 CTs, the AI identified 90 studies (56%) with pathologies, of which 81 studies (51%) were missing at least one pathology in the report. The "second-stage" LPR for all pathologies from 81 patients was RUB 2,847,760 ($37,251 or CNY 256,218). LPR only for those pathologies missed by radiologists but detected by AI was RUB 2,065,360 ($27,017 or CNY 185,824). ConclusionUsing AI for chest CTs as an "assistant" to the radiologist can significantly reduce the number of missed abnormalities. AI usage can bring 3.6 times more benefits compared to the standard model without AI. The use of complex AI for chest CT can be cost-effective.

18
Feasibility and visualization of deep learning detection and classification of inferior vena cava filters

Park, B. J.; Sotirchos, V. S.; Adleberg, J.; Stavropoulos, W.; Cook, T. S.; Hunt, S. J.

2020-06-08 radiology and imaging 10.1101/2020.06.06.20124321
Top 0.1%
133× avg
Show abstract

PurposeThis study assesses the feasibility of deep learning detection and classification of 3 retrievable inferior vena cava filters with similar radiographic appearances and emphasizes the importance of visualization methods to confirm proper detection and classification. Materials and MethodsThe fast.ai library with ResNet-34 architecture was used to train a deep learning classification model. A total of 442 fluoroscopic images (N=144 patients) from inferior vena cava filter placement or removal were collected. Following image preprocessing, the training set included 382 images (110 Celect, 149 Denali, 123 Gunther Tulip), of which 80% were used for training and 20% for validation. Data augmentation was performed for regularization. A random test set of 60 images (20 images of each filter type), not included in the training or validation set, was used for evaluation. Total accuracy and receiver operating characteristic area under the curve were used to evaluate performance. Feature heatmaps were visualized using guided backpropagation and gradient-weighted class activation mapping. ResultsThe overall accuracy was 80.2% with mean receiver operating characteristic area under the curve of 0.96 for the validation set (N=76), and 85.0% with mean receiver operating characteristic area under the curve of 0.94 for the test set (N=60). Two visualization methods were used to assess correct filter detection and classification. ConclusionsA deep learning model can be used to automatically detect and accurately classify inferior vena cava filters on radiographic images. Visualization techniques should be utilized to ensure deep learning models function as intended.

19
Novel Autosegmentation Spatial Similarity Metrics Capture the Time Required to Correct Segmentations Better than Traditional Metrics in a Thoracic Cavity Segmentation Workflow

Kiser, K.; Barman, A.; Stieb, S.; Fuller, C. D.; Giancardo, L.

2020-05-18 radiology and imaging 10.1101/2020.05.14.20102103
Top 0.1%
131× avg
Show abstract

IntroductionAutomated segmentation templates can save clinicians time compared to de novo segmentation but may still take substantial time to review and correct. It has not been thoroughly investigated which automated segmentation-corrected segmentation similarity metrics best predict clinician correction time. Materials and MethodsBilateral thoracic cavity volumes in 329 CT scans were segmented by a UNet-inspired deep learning segmentation tool and subsequently corrected by a fourth-year medical student. Eight spatial similarity metrics were calculated between the automated and corrected segmentations and associated with correction times using Spearmans rank correlation coefficients. Nine clinical variables were also associated with metrics and correction times using Spearmans rank correlation coefficients or Mann-Whitney U tests. ResultsThe added path length, false negative path length, and surface Dice similarity coefficient correlated better with correction time than traditional metrics, including the popular volumetric Dice similarity coefficient (respectively {rho} = 0.69, {rho} = 0.65, {rho} = -0.48 versus {rho} = -0.25; correlation p values < 0.001). Clinical variables poorly represented in the autosegmentation tools training data were often associated with decreased accuracy but not necessarily with prolonged correction time. DiscussionMetrics used to develop and evaluate autosegmentation tools should correlate with clinical time saved. To our knowledge, this is only the second investigation of which metrics correlate with time saved. Validation of our findings is indicated in other anatomic sites and clinical workflows. ConclusionNovel spatial similarity metrics may be preferable to traditional metrics for developing and evaluating autosegmentation tools that are intended to save clinicians time.

20
Are quantitative radiomics features comparable to semantic radiology features for pre-operative risk classification of thymic epithelial tumours?

Varghese, A. J.; Pathinathan, M.; Sasidharan, B. K.; Praveenraj, C.; Kuchipudi, R. B.; Kodiatte, T.; Mathew, M.; Isiah, R.; Pavamani, S.; Irodi, A.; Wee, L.; Dekker, A.; Thomas, H. M. T.

2025-04-28 radiology and imaging 10.1101/2025.04.26.25326466
Top 0.1%
131× avg
Show abstract

Thymic epithelial tumours (TETs) are rare and exhibit varied behaviour and prognosis based on their histological subtype, as classified by the World Health Organization (WHO). These subtypes are further categorized into low-risk and high-risk groups. Low-risk thymomas generally allow for complete surgical resection without adjuvant therapy, while high-risk types often require multimodal treatment due to their aggressive nature. This study aims to evaluate the role of CT radiomics in discriminating between high- and low-risk TETs. MethodsThis retrospective study included patients treated in a single hospital in India who underwent surgical resection of TETs from 2010 to 2024. Inclusion criteria were confirmed TET diagnosis, had pre-operative CT scans, and had medical and post-operative histopathological confirmation. Conventional CT (semantic) features were manually extracted from radiology reports, while radiomic features were obtained using PyRadiomics. The data was randomly split for training and a hold-out validation set stratified by class. Three classification models were evaluated, each using clinical, semantic, and radiomic features with LASSO regularization. The performance of the models was assessed using Area under the Receiver Operating Curve (AUC), sensitivity, and specificity on the test set. ResultsOut of 195 enrolled patients, 132 met inclusion criteria and were divided into training (n = 100) and a validation set (n=32). The clinical model included age, presence of Pure Red Cell Aplasia and weight loss, achieving an AUC of 0.69 (95% CI: 0.49-0.87), sensitivity of 0.73 (95% CI: 0.46-1.00), and specificity of 0.53 (95% CI: 0.29-0.76) in the holdout set. The radiomics model included 90th percentile and sphericity as key predictors with AUC of 0.77 (95% CI: 0.56-0.94), sensitivity of 0.82 (95% CI: 0.55-1.00), and specificity of 0.72 (95% CI: 0.52-0.91). The semantic model performed best with AUC of 0.82 (95% CI: 0.62- 0.96), sensitivity of 0.82 (95% CI: 0.55-1.00), and specificity of 0.77 (95% CI: 0.57-0.91). Discussion and ConclusionThe findings indicate that radiomic features could be valuable in preoperative risk assessment for TETs. Although the conventional CT features-based Semantic models demonstrated superior predictive capability, there is a risk of subjectivity and inter-observer disagreement.